library(topGO)
library(knitr)   # to use kable()
library(limma)   # to use vennDiagram()
library(ggplot2)
TAG <- params$TAG
VAR <- params$VAR
ANNOTATION <- list(genes = '../2019-07-26/genes/annotation.txt',
                   isoforms = '../2019-07-26/transcripts/annotation.txt')
TAGDIR <- paste0('../2020-01-08/', TAG, '/')

Introduction

This is the enrichment analysis for isoforms regulated by regime. Because I quantified the association of expression with regime in two different ways, this document has two sections. Section Variance reports the enrichment analysis performed with isoforms ordered by the proportion of expression variance explained by regime. Note that some isoforms with low variance may still be highly associated with regime, even if the fold change between levels of this factor is low. Section Differential expression uses an ordering of isoforms based on the significance of the differential expression between levels of regime, which does depend on fold change.

Reading the data

Functional annotation is in 2019-07-26. I will also upload two lists of isoforms, with either proportion of variance explained by regime or p-value of differential expression test.

tagVariance <- read.table(paste0(TAGDIR, VAR, '_variance.txt'))
tagPValue   <- read.table(paste0(TAGDIR, VAR, '_pvalue.txt'))
annotation  <- read.table(ANNOTATION[[TAG]], col.names = c('tagname', 'goterms'))

To initialize the topGOdata object, I will need the gene list as a named numeric vector, where the names are the isoforms identifiers and the numeric values, either the portion of variance explained by regime or the p-values of the differential expression test. The structure() function adds attributes to an object.

Variance <- structure(tagVariance[,1], names = row.names(tagVariance))
PValues  <- structure(tagPValue[,1],   names = row.names(tagPValue))
rm(tagVariance, tagPValue)

There are 33037 isoforms scored with a variance portion and a differential expression p-value. It should actually be the exact same isoforms. The annotation data frame has more than one GO term for every tag, separated by |. I need a named list.

head(annotation)
##          tagname
## 1 TCONS_00000002
## 2 TCONS_00000010
## 3 TCONS_00000011
## 4 TCONS_00000016
## 5 TCONS_00000017
## 6 TCONS_00000018
##                                                             goterms
## 1                                  GO:0008417|GO:0016020|GO:0006486
## 2                                  GO:0043130|GO:0005515|GO:0043161
## 3                       GO:0005840|GO:0015935|GO:0003735|GO:0006412
## 4 GO:0006355|GO:0003700|GO:0006357|GO:0043565|GO:0005634|GO:0032502
## 5 GO:0006355|GO:0003700|GO:0006357|GO:0043565|GO:0005634|GO:0032502
## 6 GO:0006355|GO:0003700|GO:0006357|GO:0043565|GO:0005634|GO:0032502
allgenes2GO <- strsplit(as.character(annotation$goterms), "|", fixed = TRUE)
names(allgenes2GO) <- annotation$tagname
rm(annotation)

There are 19735 isoforms with GO annotations. But the differential expression analysis includes many more isoforms. In order to include the not-annotated isoforms in the enrichment analysis, to see if annotated or not annotated isoforms are more or less often differentially expressed, I should attribute a GO term to them. According to [http://geneontology.org/docs/faq/] nowadays we express lack of annotation by annotating to the root nodes, i.e. GO:0008150 biological_process, GO:0003674 molecular_function, and GO:0005575 cellular_component.

for (tag in unique(c(names(PValues), names(Variance)))) {
   if (! tag %in% names(allgenes2GO)) {
      allgenes2GO <- append(allgenes2GO,
         structure(list(c("GO:0008150", "GO:0003674", "GO:0005575")), names = tag))
   }
}

Using differential expression p-values

Building the topGO object

Creation of a topGO dataset is documented in section 4 of topGO’s the user manual: https://bioconductor.org/packages/release/bioc/vignettes/topGO/inst/doc/topGO.pdf. I need to use the new command, and fill up the slots. The annot object must be a function that compiles “a list of GO terms such that each element in the list is a character vector containing all the gene identifiers that are mapped to the respective GO term.” There are several options, that you can check using help(annFUN.gene2GO), for example. The annFUN.gene2GO requires a gene2GO argument, which is the list of gene-to-GO terms I made before. The geneSelectionFun object is a function that selects the interesting (most significant) genes. It is required to perform Fisher’s exact test. The nodeSize is used to prune the GO hierarchy from the terms which have less than n annotated genes.

I generate three datasets, to analyse each of the three ontologies.

selection <- function(allScores) {return(allScores < 0.05)}
GOdata.BP <- new('topGOdata',
   description = 'Differential expression among Brachionus plicatilis populations.',
   ontology = 'BP', 
   allGenes = PValues,
   annot = annFUN.gene2GO,
   gene2GO = allgenes2GO,
   geneSelectionFun = selection,
   nodeSize = 5)

GOdata.MF <- new('topGOdata',
   description = 'Differential expression among Brachionus plicatilis populations.',
   ontology = 'MF',
   allGenes = PValues,
   annot = annFUN.gene2GO,
   gene2GO = allgenes2GO,
   geneSelectionFun = selection,
   nodeSize = 5)

GOdata.CC <- new('topGOdata',
   description = 'Differential expression among Brachionus plicatilis populations.',
   ontology = 'CC',
   allGenes = PValues,
   annot = annFUN.gene2GO,
   gene2GO = allgenes2GO,
   geneSelectionFun = selection,
   nodeSize = 5)

DataSummary <- data.frame(ontology = c('BP', 'MF', 'CC'),
   Num_Genes = sapply(list(GOdata.BP, GOdata.MF, GOdata.CC), numGenes),
   Num_GO_terms = sapply(list(GOdata.BP, GOdata.MF, GOdata.CC), function(x) length(usedGO(x))))
kable(DataSummary, caption='Number of feasible genes or transcripts and number of GO terms used in each data set.')
rm(DataSummary)
Number of feasible genes or transcripts and number of GO terms used in each data set.
ontology Num_Genes Num_GO_terms
BP 26235 1582
MF 30580 791
CC 21616 390

Running the tests

There are more than one way to test for enrichment. Something that took me a while to understand is that not only there are different test statistics (Fisher’s exact test, Kolmogorov-Smirnov, and others) but also different algorithms: classic, elim, weight… The algorithms are ways to deal with the dependence structure among GO terms due to topology. Some algorithms are compatible with all statistics implemented in topGO. But the weight and the parentchild algorithms are only applicable to Fisher’s exact test. I am not interested in the classic algorithm, which treats GO nodes as independent, and therefore produces an excess of significant results. I will not use the Fisher’s exact test, because it dependes on an arbitrary threshold of significance on non-adjusted p-values.

BP.elim     <- runTest(GOdata.BP, algorithm = "elim",     statistic = "ks")
BP.weight01 <- runTest(GOdata.BP, algorithm = "weight01", statistic = "ks")
BP.lea      <- runTest(GOdata.BP, algorithm = "lea",      statistic = "ks")
MF.elim     <- runTest(GOdata.MF, algorithm = "elim",     statistic = "ks")
MF.weight01 <- runTest(GOdata.MF, algorithm = "weight01", statistic = "ks")
MF.lea      <- runTest(GOdata.MF, algorithm = "lea",      statistic = "ks")
CC.elim     <- runTest(GOdata.CC, algorithm = "elim",     statistic = "ks")
CC.weight01 <- runTest(GOdata.CC, algorithm = "weight01", statistic = "ks")
CC.lea      <- runTest(GOdata.CC, algorithm = "lea",      statistic = "ks")

ResultsSummary <- data.frame(ontology = rep(c("BP", "MF", "CC"), each = 3),
   algorithm = rep(c("elim", "weight01", "lea"), 3),
   TermsTested = sapply(list(BP.elim, BP.weight01, BP.lea, MF.elim, MF.weight01, MF.lea, CC.elim, CC.weight01, CC.lea), function(X) length(score(X))),
   Significant = sapply(list(BP.elim, BP.weight01, BP.lea, MF.elim, MF.weight01, MF.lea, CC.elim, CC.weight01, CC.lea), function(X) sum(score(X) < 0.01)))

kable(ResultsSummary, caption="Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.")
Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.
ontology algorithm TermsTested Significant
BP elim 1582 19
BP weight01 1582 19
BP lea 1582 25
MF elim 791 21
MF weight01 791 19
MF lea 791 31
CC elim 390 13
CC weight01 390 11
CC lea 390 21
rm(ResultsSummary)

Results

The topGO package does not pay much attention to what terms are significant because they are overrepresented and which ones are underrepresented among the most differentially expressed genes. I think it’s worth separating them, to facilitate the biological interpretation. Note that not all terms listed in the tables below are significant. The scores for the three methods (elim, weight01 and lea) are non-corrected p-values.

Biological process

orderedTerms <- names(sort(score(BP.weight01)))
significant.weight01 <- score(BP.weight01)[orderedTerms] <= 0.01
significant.lea      <- score(BP.lea)[orderedTerms] <= 0.01
significant.elim     <- score(BP.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
BP.pvalue.sigTerms <- sigTerms

BP.all <- GenTable(GOdata.BP, elim=BP.elim, weight01=BP.weight01, lea=BP.lea,
                   orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))

kable(
   BP.all[BP.all$Significant > BP.all$Expected,],
   caption = "Most over-represented terms of the Biological Process ontology.")
Most over-represented terms of the Biological Process ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
1 GO:0006508 proteolysis 822 271 227.16 2 3.1e-05 2.1e-06 1.2e-05
2 GO:0007186 G protein-coupled receptor signaling pat… 302 107 83.46 1 3.4e-06 2.3e-06 3.4e-06
3 GO:0060271 cilium assembly 73 40 20.17 7 0.00036 2.3e-05 4.3e-07
4 GO:0006355 regulation of transcription, DNA-templat… 621 198 171.61 3 4.2e-05 5.6e-05 4.2e-05
5 GO:0007154 cell communication 1169 331 323.05 795 0.67468 0.00016 0.70887
6 GO:0060294 cilium movement involved in cell motilit… 7 7 1.93 4 0.00018 0.00018 0.00018
7 GO:0000079 regulation of cyclin-dependent protein s… 9 8 2.49 5 0.00030 0.00030 0.00030
8 GO:0006813 potassium ion transport 103 42 28.46 8 0.00044 0.00049 0.00044
9 GO:0005991 trehalose metabolic process 22 15 6.08 9 0.00105 0.00105 1.3e-05
10 GO:0035082 axoneme assembly 18 13 4.97 6 0.00035 0.00226 0.00035
12 GO:0005992 trehalose biosynthetic process 7 5 1.93 11 0.00315 0.00315 0.00315
14 GO:0006030 chitin metabolic process 103 42 28.46 13 0.00596 0.00516 0.00596
15 GO:0005975 carbohydrate metabolic process 360 121 99.49 104 0.06703 0.00642 0.05250
16 GO:0006367 transcription initiation from RNA polyme… 20 10 5.53 15 0.00809 0.00809 0.00809
17 GO:0042246 tissue regeneration 6 5 1.66 16 0.00815 0.00815 0.00815
18 GO:0006979 response to oxidative stress 40 15 11.05 17 0.00892 0.00892 0.00892
19 GO:0031146 SCF-dependent proteasomal ubiquitin-depe… 5 4 1.38 19 0.00988 0.00988 0.00988
kable(
   BP.all[BP.all$Significant < BP.all$Expected,],
   caption = "Most under-represented terms of the Biological Process ontology.")
Most under-represented terms of the Biological Process ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
11 GO:0044271 cellular nitrogen compound biosynthetic … 1365 361 377.22 1484 0.99135 0.00255 0.97186
13 GO:0006520 cellular amino acid metabolic process 208 57 57.48 502 0.43200 0.00321 0.61700
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
                             lea=significant.lea,
                             elim=significant.elim)))

kable(data.frame(Term = Term(GOTERM[sigTerms]),
                 Definition = Definition(GOTERM[sigTerms]),
                 PValue=score(BP.weight01)[sigTerms]),
      caption = paste('Biological process terms significantly associated with',
                      VAR, 'according to all 3 algorithms', sep=' '))
Biological process terms significantly associated with regime according to all 3 algorithms
Term Definition PValue
GO:0006508 proteolysis The hydrolysis of proteins into smaller polypeptides and/or amino acids by cleavage of their peptide bonds. 0.0000021
GO:0007186 G protein-coupled receptor signaling pathway A series of molecular signals that proceeds with an activated receptor promoting the exchange of GDP for GTP on the alpha-subunit of an associated heterotrimeric G-protein complex. The GTP-bound activated alpha-G-protein then dissociates from the beta- and gamma-subunits to further transmit the signal within the cell. The pathway begins with receptor-ligand interaction, or for basal GPCR signaling the pathway begins with the receptor activating its G protein in the absence of an agonist, and ends with regulation of a downstream cellular process, e.g. transcription. The pathway can start from the plasma membrane, Golgi or nuclear membrane (PMID:24568158 and PMID:16902576). 0.0000023
GO:0060271 cilium assembly The assembly of a cilium, a specialized eukaryotic organelle that consists of a filiform extrusion of the cell surface. Each cilium is bounded by an extrusion of the cytoplasmic membrane, and contains a regular longitudinal array of microtubules, anchored basally in a centriole. 0.0000229
GO:0006355 regulation of transcription, DNA-templated Any process that modulates the frequency, rate or extent of cellular DNA-templated transcription. 0.0000562
GO:0060294 cilium movement involved in cell motility Movement of cilia mediated by motor proteins that contributes to the movement of a cell. 0.0001825
GO:0000079 regulation of cyclin-dependent protein serine/threonine kinase activity Any process that modulates the frequency, rate or extent of cyclin-dependent protein serine/threonine kinase activity. 0.0003024
GO:0006813 potassium ion transport The directed movement of potassium ions (K+) into, out of or within a cell, or between cells, by means of some agent such as a transporter or pore. 0.0004926
GO:0005991 trehalose metabolic process The chemical reactions and pathways involving trehalose, a disaccharide isomeric with sucrose and obtained from certain lichens and fungi. 0.0010535
GO:0035082 axoneme assembly The assembly and organization of an axoneme, the bundle of microtubules and associated proteins that forms the core of cilia (also called flagella) in eukaryotic cells and is responsible for their movements. 0.0022620
GO:0005992 trehalose biosynthetic process The chemical reactions and pathways resulting in the formation of trehalose, a disaccharide isomeric with sucrose and obtained from certain lichens and fungi. 0.0031545
GO:0006030 chitin metabolic process The chemical reactions and pathways involving chitin, a linear polysaccharide consisting of beta-(1->4)-linked N-acetyl-D-glucosamine residues. 0.0051557
GO:0006367 transcription initiation from RNA polymerase II promoter Any process involved in the assembly of the RNA polymerase II preinitiation complex (PIC) at an RNA polymerase II promoter region of a DNA template, resulting in the subsequent synthesis of RNA from that promoter. The initiation phase includes PIC assembly and the formation of the first few bonds in the RNA chain, including abortive initiation, which occurs when the first few nucleotides are repeatedly synthesized and then released. Promoter clearance, or release, is the transition between the initiation and elongation phases of transcription. 0.0080904
GO:0042246 tissue regeneration The regrowth of lost or destroyed tissues. 0.0081542
GO:0006979 response to oxidative stress Any process that results in a change in state or activity of a cell or an organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of oxidative stress, a state often resulting from exposure to high levels of reactive oxygen species, e.g. superoxide anions, hydrogen peroxide (H2O2), and hydroxyl radicals. 0.0089181
GO:0031146 SCF-dependent proteasomal ubiquitin-dependent protein catabolic process The chemical reactions and pathways resulting in the breakdown of a protein or peptide by hydrolysis of its peptide bonds, initiated by the covalent attachment of ubiquitin, with ubiquitin-protein ligation catalyzed by an SCF (Skp1/Cul1/F-box protein) complex, and mediated by the proteasome. 0.0098752

I think the GO graph is useful to see the relationship among the significant terms. But too large graphs are impossible to read. I don’t know how to split the graph in meaningful subgraphs.

showSigOfNodes(GOdata.BP, score(BP.weight01),
               firstSigNodes = sum(significant.elim),
               wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 166 
## Number of Edges = 300 
## 
## $complete.dag
## [1] "A graph with 166 nodes."

This is just a example of the most significant GO term:

showGroupDensity(GOdata.BP, names(score(BP.weight01))[which.min(score(BP.weight01))])

Molecular function

orderedTerms <- names(sort(score(MF.weight01)))
significant.weight01 <- score(MF.weight01)[orderedTerms] <= 0.01
significant.lea      <- score(MF.lea)[orderedTerms] <= 0.01
significant.elim     <- score(MF.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
MF.pvalue.sigTerms <- sigTerms

MF.all <- GenTable(GOdata.MF, elim=MF.elim, weight01=MF.weight01, lea=MF.lea,
                   orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))

kable(
   MF.all[MF.all$Significant > MF.all$Expected,],
   caption = "Most over-represented terms of the Molecular Function ontology.")
Most over-represented terms of the Molecular Function ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
GO:0005509 calcium ion binding 638 255 177.46 1 2.1e-16 2.1e-16 2.1e-16
GO:0004930 G protein-coupled receptor activity 251 96 69.82 2 9.3e-06 6.5e-05 1.2e-06
GO:0043565 sequence-specific DNA binding 221 77 61.47 5 0.00068 9.2e-05 0.00068
GO:0008417 fucosyltransferase activity 42 20 11.68 3 0.00039 0.00039 0.00039
GO:0030248 cellulose binding 8 7 2.23 4 0.00054 0.00054 0.00054
GO:0003774 motor activity 277 93 77.05 158 0.14077 0.00097 0.14077
GO:0004181 metallocarboxypeptidase activity 39 22 10.85 6 0.00109 0.00109 0.00109
GO:0004555 alpha,alpha-trehalase activity 15 10 4.17 7 0.00113 0.00113 0.00113
GO:0008134 transcription factor binding 33 11 9.18 18 0.00804 0.00123 0.00804
GO:0005200 structural constituent of cytoskeleton 23 14 6.40 8 0.00150 0.00150 0.00150
GO:0004252 serine-type endopeptidase activity 143 53 39.78 10 0.00264 0.00264 0.00264
GO:0016462 pyrophosphatase activity 952 290 264.80 214 0.24145 0.00295 0.65612
GO:0008061 chitin binding 112 42 31.15 12 0.00379 0.00379 0.00379
GO:0004198 calcium-dependent cysteine-type endopept… 62 22 17.25 13 0.00390 0.00390 0.00390
GO:0004611 phosphoenolpyruvate carboxykinase activi… 7 6 1.95 14 0.00435 0.00435 0.00435
GO:0004601 peroxidase activity 59 21 16.41 115 0.07881 0.00437 0.07881
GO:0005267 potassium channel activity 80 33 22.25 11 0.00367 0.00593 0.00367
GO:0004983 neuropeptide Y receptor activity 11 7 3.06 16 0.00594 0.00594 0.00594
GO:0008047 enzyme activator activity 135 41 37.55 105 0.07008 0.00805 0.07008
GO:0030246 carbohydrate binding 83 40 23.09 28 0.01442 0.01061 0.04768
kable(
   MF.all[MF.all$Significant < MF.all$Expected,],
   caption = "Most under-represented terms of the Molecular Function ontology.")
Most under-represented terms of the Molecular Function ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
21 GO:0140098 catalytic activity, acting on RNA 278 68 77.33 505 0.73141 0.01160 0.73141
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
                             lea=significant.lea,
                             elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
                 Definition=Definition(GOTERM[sigTerms]),
                 PValue=score(MF.weight01)[sigTerms]),
      caption = paste('Molecular function terms significantly associated with', VAR,
                      'according to all 3 algorithms', sep=' '))
Molecular function terms significantly associated with regime according to all 3 algorithms
Term Definition PValue
GO:0005509 calcium ion binding Interacting selectively and non-covalently with calcium ions (Ca2+). 0.0000000
GO:0004930 G protein-coupled receptor activity Combining with an extracellular signal and transmitting the signal across the membrane by activating an associated G-protein; promotes the exchange of GDP for GTP on the alpha subunit of a heterotrimeric G-protein complex. 0.0000650
GO:0043565 sequence-specific DNA binding Interacting selectively and non-covalently with DNA of a specific nucleotide composition, e.g. GC-rich DNA binding, or with a specific sequence motif or type of DNA e.g. promotor binding or rDNA binding. 0.0000918
GO:0008417 fucosyltransferase activity Catalysis of the transfer of a fucosyl group to an acceptor molecule, typically another carbohydrate or a lipid. 0.0003941
GO:0030248 cellulose binding Interacting selectively and non-covalently with cellulose. 0.0005411
GO:0004181 metallocarboxypeptidase activity Catalysis of the hydrolysis of C-terminal amino acid residues from a polypeptide chain by a mechanism in which water acts as a nucleophile, one or two metal ions hold the water molecule in place, and charged amino acid side chains are ligands for the metal ions. 0.0010874
GO:0004555 alpha,alpha-trehalase activity Catalysis of the reaction: alpha,alpha-trehalose + H2O = 2 D-glucose. 0.0011320
GO:0008134 transcription factor binding Interacting selectively and non-covalently with a transcription factor, any protein required to initiate or regulate transcription. 0.0012305
GO:0005200 structural constituent of cytoskeleton The action of a molecule that contributes to the structural integrity of a cytoskeletal structure. 0.0014977
GO:0004252 serine-type endopeptidase activity Catalysis of the hydrolysis of internal, alpha-peptide bonds in a polypeptide chain by a catalytic mechanism that involves a catalytic triad consisting of a serine nucleophile that is activated by a proton relay involving an acidic residue (e.g. aspartate or glutamate) and a basic residue (usually histidine). 0.0026383
GO:0008061 chitin binding Interacting selectively and non-covalently with chitin, a linear polysaccharide consisting of beta-(1->4)-linked N-acetyl-D-glucosamine residues. 0.0037881
GO:0004198 calcium-dependent cysteine-type endopeptidase activity Catalysis of the hydrolysis of nonterminal peptide bonds in a polypeptide chain by a mechanism using a cysteine residue at the enzyme active center, and requiring the presence of calcium. 0.0039050
GO:0004611 phosphoenolpyruvate carboxykinase activity Catalysis of the reaction: source of phosphate + oxaloacetate = phosphoenolpyruvate + CO2 + other reaction products. 0.0043484
GO:0005267 potassium channel activity Enables the facilitated diffusion of a potassium ion (by an energy-independent process) involving passage through a transmembrane aqueous pore or channel without evidence for a carrier-mediated mechanism. 0.0059322
GO:0004983 neuropeptide Y receptor activity Combining with neuropeptide Y to initiate a change in cell activity. 0.0059417
showSigOfNodes(GOdata.MF, score(MF.weight01),
               firstSigNodes = sum(significant.elim),
               wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 87 
## Number of Edges = 103 
## 
## $complete.dag
## [1] "A graph with 87 nodes."
showGroupDensity(GOdata.MF, names(score(MF.weight01))[which.min(score(MF.weight01))])

Cellular component

orderedTerms <- names(sort(score(CC.weight01)))
significant.weight01 <- score(CC.weight01)[orderedTerms] <= 0.01
significant.lea      <- score(CC.lea)[orderedTerms] <= 0.01
significant.elim     <- score(CC.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
CC.pvalue.sigTerms <- sigTerms

CC.all <- GenTable(GOdata.CC, elim=CC.elim, weight01=CC.weight01, lea=CC.lea,
                   orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))
kable(
   CC.all[CC.all$Significant > CC.all$Expected,],
   caption = "Most over-represented terms of the Cellular Component ontology.")
Most over-represented terms of the Cellular Component ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
GO:0005576 extracellular region 197 83 54.27 1 1.3e-06 1.5e-05 1.3e-06
GO:0005743 mitochondrial inner membrane 44 23 12.12 2 3.6e-05 0.00032 3.6e-05
GO:0016459 myosin complex 57 27 15.70 5 0.00078 0.00078 0.00078
GO:0005929 cilium 52 27 14.33 6 0.00088 0.00102 0.00088
GO:0016021 integral component of membrane 1739 550 479.08 3 0.00046 0.00231 6.1e-05
GO:0005886 plasma membrane 324 120 89.26 7 0.00094 0.00485 0.00072
GO:0090575 RNA polymerase II transcription factor c… 47 19 12.95 29 0.02464 0.00505 0.02464
GO:0098800 inner mitochondrial membrane protein com… 27 13 7.44 25 0.02095 0.00515 0.02095
GO:0042555 MCM complex 12 8 3.31 10 0.00610 0.00610 0.00610
GO:0099512 supramolecular fiber 52 23 14.33 8 0.00353 0.00812 0.00353
GO:0031225 anchored component of membrane 8 5 2.20 13 0.00903 0.00903 0.00903
GO:0016020 membrane 3059 942 842.73 30 0.02530 0.01056 1.5e-07
GO:0030123 AP-3 adaptor complex 8 5 2.20 15 0.01204 0.01204 0.01204
kable(
   CC.all[CC.all$Significant < CC.all$Expected,],
   caption = "Most under-represented terms of the Cellular Component ontology.")
Most under-represented terms of the Cellular Component ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
                             lea=significant.lea,
                             elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
                 Definition=Definition(GOTERM[sigTerms]),
                 PValue=score(CC.weight01)[sigTerms]),
      caption = paste('Cellular component terms significantly associated with', VAR,
                      'according to all 3 algorithms', sep=' '))
Cellular component terms significantly associated with regime according to all 3 algorithms
Term Definition PValue
GO:0005576 extracellular region The space external to the outermost structure of a cell. For cells without external protective or external encapsulating structures this refers to space outside of the plasma membrane. This term covers the host cell environment outside an intracellular parasite. 0.0000146
GO:0005743 mitochondrial inner membrane The inner, i.e. lumen-facing, lipid bilayer of the mitochondrial envelope. It is highly folded to form cristae. 0.0003229
GO:0016459 myosin complex A protein complex, formed of one or more myosin heavy chains plus associated light chains and other proteins, that functions as a molecular motor; uses the energy of ATP hydrolysis to move actin filaments or to move vesicles or other cargo on fixed actin filaments; has magnesium-ATPase activity and binds actin. Myosin classes are distinguished based on sequence features of the motor, or head, domain, but also have distinct tail regions that are believed to bind specific cargoes. 0.0007773
GO:0005929 cilium A specialized eukaryotic organelle that consists of a filiform extrusion of the cell surface and of some cytoplasmic parts. Each cilium is largely bounded by an extrusion of the cytoplasmic (plasma) membrane, and contains a regular longitudinal array of microtubules, anchored to a basal body. 0.0010225
GO:0016021 integral component of membrane The component of a membrane consisting of the gene products and protein complexes having at least some part of their peptide sequence embedded in the hydrophobic region of the membrane. 0.0023121
GO:0005886 plasma membrane The membrane surrounding a cell that separates the cell from its external environment. It consists of a phospholipid bilayer and associated proteins. 0.0048464
GO:0042555 MCM complex A hexameric protein complex required for the initiation and regulation of DNA replication. 0.0060991
GO:0099512 supramolecular fiber A polymer consisting of an indefinite number of protein or protein complex subunits that have polymerised to form a fiber-shaped structure. 0.0081229
GO:0031225 anchored component of membrane The component of a membrane consisting of the gene products that are tethered to the membrane only by a covalently attached anchor, such as a lipid group that is embedded in the membrane. Gene products with peptide sequences that are embedded in the membrane are excluded from this grouping. 0.0090256
showSigOfNodes(GOdata.CC, score(CC.weight01),
               firstSigNodes = sum(significant.elim),
               wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 57 
## Number of Edges = 99 
## 
## $complete.dag
## [1] "A graph with 57 nodes."
showGroupDensity(GOdata.CC, names(score(CC.weight01))[which.min(score(CC.weight01))])

Using the portion of variance explained by regime

Building the topGO object

I need to generate the topGO objects again, using the alternative gene ordering, based on the proportion of expression-level variance explained by regime. I miss a way to inform the topGOdata object that the score in this case is reversed, with respect to p-values: the higher it is, the more differentially expressed the gene is. To make sure that GO terms are tested in the same way than when using p-values, I will just reverse the proportion of variance explained by regime to its complement. Taking this into account, the subset of interesting genes (selection() function) must be defined as the lowest 10% scores, which are the 10% genes with largest portion of variance explained by regime.

selection <- function(allScores) {return(allScores <= quantile(allScores, probs = 0.10))}
GOdataVar.BP <- new('topGOdata',
   description = 'Differential expression among Brachionus plicatilis populations.',
   ontology = 'BP', 
   allGenes = 1.0 - Variance,
   annot = annFUN.gene2GO,
   gene2GO = allgenes2GO,
   geneSelectionFun = selection,
   nodeSize = 5)

GOdataVar.MF <- new('topGOdata',
   description = 'Differential expression among Brachionus plicatilis populations.',
   ontology = 'MF',
   allGenes = 1.0 - Variance,
   annot = annFUN.gene2GO,
   gene2GO = allgenes2GO,
   geneSelectionFun = selection,
   nodeSize = 5)

GOdataVar.CC <- new('topGOdata',
   description = 'Differential expression among Brachionus plicatilis populations.',
   ontology = 'CC',
   allGenes = 1.0 - Variance,
   annot = annFUN.gene2GO,
   gene2GO = allgenes2GO,
   geneSelectionFun = selection,
   nodeSize = 5)

library(knitr)
DataSummary <- data.frame(ontology = c('BP', 'MF', 'CC'),
   Num_Genes = sapply(list(GOdataVar.BP, GOdataVar.MF, GOdataVar.CC), numGenes),
   Num_GO_terms = sapply(list(GOdataVar.BP, GOdataVar.MF, GOdataVar.CC), function(x) length(usedGO(x))))
kable(DataSummary, caption='Number of feasible genes or transcripts and number of GO terms used in each data set.')
rm(DataSummary)
Number of feasible genes or transcripts and number of GO terms used in each data set.
ontology Num_Genes Num_GO_terms
BP 26235 1582
MF 30580 791
CC 21616 390

Running the tests

BPvar.elim     <- runTest(GOdataVar.BP, algorithm = "elim",     statistic = "ks")
BPvar.weight01 <- runTest(GOdataVar.BP, algorithm = "weight01", statistic = "ks")
BPvar.lea      <- runTest(GOdataVar.BP, algorithm = "lea",      statistic = "ks")
MFvar.elim     <- runTest(GOdataVar.MF, algorithm = "elim",     statistic = "ks")
MFvar.weight01 <- runTest(GOdataVar.MF, algorithm = "weight01", statistic = "ks")
MFvar.lea      <- runTest(GOdataVar.MF, algorithm = "lea",      statistic = "ks")
CCvar.elim     <- runTest(GOdataVar.CC, algorithm = "elim",     statistic = "ks")
CCvar.weight01 <- runTest(GOdataVar.CC, algorithm = "weight01", statistic = "ks")
CCvar.lea      <- runTest(GOdataVar.CC, algorithm = "lea",      statistic = "ks")

ResultsSummary <- data.frame(ontology = rep(c("BP", "MF", "CC"), each = 3),
   algorithm = rep(c("elim", "weight01", "lea"), 3),
   TermsTested = sapply(list(BPvar.elim, BPvar.weight01, BPvar.lea, MFvar.elim, MFvar.weight01, MFvar.lea, CCvar.elim, CCvar.weight01, CCvar.lea), function(X) length(score(X))),
   Significant = sapply(list(BPvar.elim, BPvar.weight01, BPvar.lea, MFvar.elim, MFvar.weight01, MFvar.lea, CCvar.elim, CCvar.weight01, CCvar.lea), function(X) sum(score(X) < 0.01)))

kable(ResultsSummary, caption="Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.")
Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.
ontology algorithm TermsTested Significant
BP elim 1582 24
BP weight01 1582 28
BP lea 1582 32
MF elim 791 30
MF weight01 791 29
MF lea 791 45
CC elim 390 15
CC weight01 390 15
CC lea 390 24
rm(ResultsSummary)

Results

Biological process

orderedTerms <- names(sort(score(BPvar.weight01)))
significant.weight01 <- score(BPvar.weight01)[orderedTerms] <= 0.01
significant.lea      <- score(BPvar.lea)[orderedTerms] <= 0.01
significant.elim     <- score(BPvar.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
BP.variance.sigTerms <- sigTerms

BPvar.all <- GenTable(GOdataVar.BP, elim=BPvar.elim, weight01=BPvar.weight01, lea=BPvar.lea,
                      orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))

kable(
   BPvar.all[BPvar.all$Significant > BPvar.all$Expected,],
   caption = "Most over-represented terms of the Biological Process ontology.")
Most over-represented terms of the Biological Process ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
1 GO:0007186 G protein-coupled receptor signaling pat… 302 42 29.80 1 5.1e-08 1.0e-09 5.1e-08
2 GO:0006030 chitin metabolic process 103 19 10.16 2 6.1e-08 1.4e-07 6.1e-08
3 GO:0006508 proteolysis 822 125 81.12 6 0.00017 1.7e-07 7.9e-05
4 GO:0060271 cilium assembly 73 20 7.20 4 0.00012 8.0e-05 6.9e-07
6 GO:0006355 regulation of transcription, DNA-templat… 621 64 61.28 3 0.00011 0.00023 0.00011
7 GO:0006813 potassium ion transport 103 20 10.16 7 0.00034 0.00034 2.8e-05
8 GO:0071805 potassium ion transmembrane transport 20 7 1.97 15 0.00401 0.00079 0.00401
9 GO:0005991 trehalose metabolic process 22 10 2.17 9 0.00080 0.00080 8.5e-06
10 GO:0007154 cell communication 1169 116 115.36 1292 0.96294 0.00085 0.98254
11 GO:0060294 cilium movement involved in cell motilit… 7 4 0.69 11 0.00155 0.00155 0.00155
12 GO:0005992 trehalose biosynthetic process 7 5 0.69 12 0.00177 0.00177 0.00177
14 GO:0006814 sodium ion transport 108 14 10.66 13 0.00266 0.00266 0.00266
16 GO:0003341 cilium movement 22 8 2.17 14 0.00400 0.00400 0.00013
17 GO:0007017 microtubule-based process 389 59 38.39 240 0.14772 0.00411 0.14772
19 GO:0034220 ion transmembrane transport 306 49 30.20 25 0.01019 0.00531 0.00092
20 GO:0005975 carbohydrate metabolic process 360 55 35.53 129 0.06969 0.00535 0.05990
22 GO:0000079 regulation of cyclin-dependent protein s… 9 2 0.89 19 0.00570 0.00570 0.00570
23 GO:0055085 transmembrane transport 1062 152 104.80 8 0.00054 0.00605 0.00015
24 GO:0042246 tissue regeneration 6 2 0.59 20 0.00624 0.00624 0.00624
kable(
   BPvar.all[BPvar.all$Significant < BPvar.all$Expected,],
   caption = "Most under-represented terms of the Biological Process ontology.")
Most under-represented terms of the Biological Process ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
5 GO:0007156 homophilic cell adhesion via plasma memb… 31 3 3.06 5 0.00016 0.00016 0.00016
13 GO:0044271 cellular nitrogen compound biosynthetic … 1365 99 134.70 1127 0.88493 0.00239 0.97663
15 GO:0006520 cellular amino acid metabolic process 208 16 20.53 1035 0.84021 0.00297 0.98763
18 GO:0006464 cellular protein modification process 1369 112 135.10 1418 0.98480 0.00417 0.97305
21 GO:0006486 protein glycosylation 106 6 10.46 122 0.06463 0.00547 0.06463
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
                             lea=significant.lea,
                             elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
                 Definition=Definition(GOTERM[sigTerms]),
                 PValue=score(BPvar.weight01)[sigTerms]),
      caption = paste('Biological process terms significantly associated with', VAR,
                      'according to all 3 algorithms', sep=' '))
Biological process terms significantly associated with regime according to all 3 algorithms
Term Definition PValue
GO:0007186 G protein-coupled receptor signaling pathway A series of molecular signals that proceeds with an activated receptor promoting the exchange of GDP for GTP on the alpha-subunit of an associated heterotrimeric G-protein complex. The GTP-bound activated alpha-G-protein then dissociates from the beta- and gamma-subunits to further transmit the signal within the cell. The pathway begins with receptor-ligand interaction, or for basal GPCR signaling the pathway begins with the receptor activating its G protein in the absence of an agonist, and ends with regulation of a downstream cellular process, e.g. transcription. The pathway can start from the plasma membrane, Golgi or nuclear membrane (PMID:24568158 and PMID:16902576). 0.0000000
GO:0006030 chitin metabolic process The chemical reactions and pathways involving chitin, a linear polysaccharide consisting of beta-(1->4)-linked N-acetyl-D-glucosamine residues. 0.0000001
GO:0006508 proteolysis The hydrolysis of proteins into smaller polypeptides and/or amino acids by cleavage of their peptide bonds. 0.0000002
GO:0060271 cilium assembly The assembly of a cilium, a specialized eukaryotic organelle that consists of a filiform extrusion of the cell surface. Each cilium is bounded by an extrusion of the cytoplasmic membrane, and contains a regular longitudinal array of microtubules, anchored basally in a centriole. 0.0000796
GO:0007156 homophilic cell adhesion via plasma membrane adhesion molecules The attachment of a plasma membrane adhesion molecule in one cell to an identical molecule in an adjacent cell. 0.0001597
GO:0006355 regulation of transcription, DNA-templated Any process that modulates the frequency, rate or extent of cellular DNA-templated transcription. 0.0002253
GO:0006813 potassium ion transport The directed movement of potassium ions (K+) into, out of or within a cell, or between cells, by means of some agent such as a transporter or pore. 0.0003356
GO:0071805 potassium ion transmembrane transport A process in which a potassium ion is transported from one side of a membrane to the other. 0.0007875
GO:0005991 trehalose metabolic process The chemical reactions and pathways involving trehalose, a disaccharide isomeric with sucrose and obtained from certain lichens and fungi. 0.0008019
GO:0060294 cilium movement involved in cell motility Movement of cilia mediated by motor proteins that contributes to the movement of a cell. 0.0015464
GO:0005992 trehalose biosynthetic process The chemical reactions and pathways resulting in the formation of trehalose, a disaccharide isomeric with sucrose and obtained from certain lichens and fungi. 0.0017746
GO:0006814 sodium ion transport The directed movement of sodium ions (Na+) into, out of or within a cell, or between cells, by means of some agent such as a transporter or pore. 0.0026564
GO:0003341 cilium movement The directed, self-propelled movement of a cilium. 0.0039975
GO:0000079 regulation of cyclin-dependent protein serine/threonine kinase activity Any process that modulates the frequency, rate or extent of cyclin-dependent protein serine/threonine kinase activity. 0.0057002
GO:0055085 transmembrane transport The process in which a solute is transported across a lipid bilayer, from one side of a membrane to the other. 0.0060495
GO:0042246 tissue regeneration The regrowth of lost or destroyed tissues. 0.0062426
GO:0007160 cell-matrix adhesion The binding of a cell to the extracellular matrix via adhesion molecules. 0.0071326
GO:0016539 intein-mediated protein splicing The removal of an internal amino acid sequence (an intein) from a protein during protein maturation; the excision of inteins is precise and the N- and C-terminal exteins are joined by a normal peptide bond. Protein splicing involves 4 nucleophilic displacements by the 3 conserved splice junction residues. 0.0075827
GO:0006821 chloride transport The directed movement of chloride into, out of or within a cell, or between cells, by means of some agent such as a transporter or pore. 0.0080310
showSigOfNodes(GOdataVar.BP, score(BPvar.weight01),
               firstSigNodes = sum(significant.elim),
               wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 172 
## Number of Edges = 305 
## 
## $complete.dag
## [1] "A graph with 172 nodes."

Below I plot variance portion, but for the termo found most significant when using p-values, for comparison.

showGroupDensity(GOdataVar.BP, names(score(BP.weight01))[which.min(score(BP.weight01))])

Molecular function

orderedTerms <- names(sort(score(MFvar.weight01)))
significant.weight01 <- score(MFvar.weight01)[orderedTerms] <= 0.01
significant.lea      <- score(MFvar.lea)[orderedTerms] <= 0.01
significant.elim     <- score(MFvar.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
MF.variance.sigTerms <- sigTerms

MFvar.all <- GenTable(GOdataVar.MF, elim=MFvar.elim, weight01=MFvar.weight01, lea=MFvar.lea,
                      orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))
kable(
   MFvar.all[MFvar.all$Significant > MFvar.all$Expected,],
   caption = "Most over-represented terms of the Molecular Function ontology.")
Most over-represented terms of the Molecular Function ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
1 GO:0005509 calcium ion binding 638 124 64.63 1 2.5e-20 2.5e-20 2.5e-20
2 GO:0004930 G protein-coupled receptor activity 251 36 25.43 2 4.6e-08 4.4e-08 5.5e-10
3 GO:0008061 chitin binding 112 20 11.35 3 1.8e-07 1.8e-07 1.8e-07
4 GO:0043565 sequence-specific DNA binding 221 25 22.39 5 9.2e-05 6.9e-06 9.2e-05
5 GO:0004181 metallocarboxypeptidase activity 39 12 3.95 4 3.8e-05 3.8e-05 3.8e-05
6 GO:0003774 motor activity 277 48 28.06 15 0.00353 8.4e-05 0.00353
8 GO:0005200 structural constituent of cytoskeleton 23 10 2.33 7 0.00015 0.00015 0.00015
9 GO:0005267 potassium channel activity 80 18 8.10 20 0.00552 0.00038 0.00011
10 GO:0030248 cellulose binding 8 2 0.81 9 0.00084 0.00084 0.00084
11 GO:0004555 alpha,alpha-trehalase activity 15 5 1.52 10 0.00096 0.00096 0.00096
12 GO:0016651 oxidoreductase activity, acting on NAD(P… 47 5 4.76 49 0.02485 0.00150 0.02485
13 GO:0004252 serine-type endopeptidase activity 143 29 14.49 11 0.00150 0.00150 0.00150
14 GO:0019205 nucleobase-containing compound kinase ac… 67 12 6.79 47 0.02014 0.00157 0.00149
15 GO:0004553 hydrolase activity, hydrolyzing O-glycos… 169 31 17.12 48 0.02309 0.00186 0.02309
16 GO:0005249 voltage-gated potassium channel activity 54 8 5.47 21 0.00559 0.00193 0.00559
18 GO:0004867 serine-type endopeptidase inhibitor acti… 20 4 2.03 13 0.00321 0.00321 0.00321
19 GO:0005507 copper ion binding 29 8 2.94 14 0.00351 0.00351 0.00351
20 GO:0051015 actin filament binding 47 11 4.76 17 0.00415 0.00415 0.00415
22 GO:0008083 growth factor activity 8 2 0.81 19 0.00450 0.00450 0.00450
23 GO:0005272 sodium channel activity 83 12 8.41 22 0.00627 0.00627 0.00627
24 GO:0005525 GTP binding 334 45 33.84 24 0.00657 0.00657 0.00657
25 GO:0005201 extracellular matrix structural constitu… 7 3 0.71 26 0.00755 0.00755 0.00755
26 GO:0031683 G-protein beta/gamma-subunit complex bin… 17 4 1.72 27 0.00820 0.00820 0.00820
27 GO:0004550 nucleoside diphosphate kinase activity 15 4 1.52 28 0.00837 0.00837 0.00837
28 GO:0004040 amidase activity 5 3 0.51 29 0.00868 0.00868 0.00868
29 GO:0030246 carbohydrate binding 83 20 8.41 31 0.01014 0.00990 0.02445
30 GO:0015267 channel activity 476 68 48.22 543 0.83162 0.01020 0.00020
kable(
   MFvar.all[MFvar.all$Significant < MFvar.all$Expected,],
   caption = "Most under-represented terms of the Molecular Function ontology.")
Most under-represented terms of the Molecular Function ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
7 GO:0008417 fucosyltransferase activity 42 4 4.25 6 0.00013 0.00013 0.00013
17 GO:0005515 protein binding 4270 424 432.59 262 0.32261 0.00273 0.53496
21 GO:0008378 galactosyltransferase activity 24 2 2.43 18 0.00422 0.00422 0.00422
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
                             lea=significant.lea,
                             elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
                 Definition=Definition(GOTERM[sigTerms]),
                 PValue=score(MFvar.weight01)[sigTerms]),
      caption = paste('Molecular functions terms significantly associated with', VAR,
                      'according to all 3 algorithms', sep=' '))
Molecular functions terms significantly associated with regime according to all 3 algorithms
Term Definition PValue
GO:0005509 calcium ion binding Interacting selectively and non-covalently with calcium ions (Ca2+). 0.0000000
GO:0004930 G protein-coupled receptor activity Combining with an extracellular signal and transmitting the signal across the membrane by activating an associated G-protein; promotes the exchange of GDP for GTP on the alpha subunit of a heterotrimeric G-protein complex. 0.0000000
GO:0008061 chitin binding Interacting selectively and non-covalently with chitin, a linear polysaccharide consisting of beta-(1->4)-linked N-acetyl-D-glucosamine residues. 0.0000002
GO:0043565 sequence-specific DNA binding Interacting selectively and non-covalently with DNA of a specific nucleotide composition, e.g. GC-rich DNA binding, or with a specific sequence motif or type of DNA e.g. promotor binding or rDNA binding. 0.0000069
GO:0004181 metallocarboxypeptidase activity Catalysis of the hydrolysis of C-terminal amino acid residues from a polypeptide chain by a mechanism in which water acts as a nucleophile, one or two metal ions hold the water molecule in place, and charged amino acid side chains are ligands for the metal ions. 0.0000382
GO:0003774 motor activity Catalysis of the generation of force resulting either in movement along a microfilament or microtubule, or in torque resulting in membrane scission, coupled to the hydrolysis of a nucleoside triphosphate. 0.0000839
GO:0008417 fucosyltransferase activity Catalysis of the transfer of a fucosyl group to an acceptor molecule, typically another carbohydrate or a lipid. 0.0001261
GO:0005200 structural constituent of cytoskeleton The action of a molecule that contributes to the structural integrity of a cytoskeletal structure. 0.0001488
GO:0005267 potassium channel activity Enables the facilitated diffusion of a potassium ion (by an energy-independent process) involving passage through a transmembrane aqueous pore or channel without evidence for a carrier-mediated mechanism. 0.0003783
GO:0030248 cellulose binding Interacting selectively and non-covalently with cellulose. 0.0008382
GO:0004555 alpha,alpha-trehalase activity Catalysis of the reaction: alpha,alpha-trehalose + H2O = 2 D-glucose. 0.0009596
GO:0004252 serine-type endopeptidase activity Catalysis of the hydrolysis of internal, alpha-peptide bonds in a polypeptide chain by a catalytic mechanism that involves a catalytic triad consisting of a serine nucleophile that is activated by a proton relay involving an acidic residue (e.g. aspartate or glutamate) and a basic residue (usually histidine). 0.0015028
GO:0005249 voltage-gated potassium channel activity Enables the transmembrane transfer of a potassium ion by a voltage-gated channel. A voltage-gated channel is a channel whose open state is dependent on the voltage across the membrane in which it is embedded. 0.0019323
GO:0004867 serine-type endopeptidase inhibitor activity Stops, prevents or reduces the activity of serine-type endopeptidases, enzymes that catalyze the hydrolysis of nonterminal peptide bonds in a polypeptide chain; a serine residue (and a histidine residue) are at the active center of the enzyme. 0.0032063
GO:0005507 copper ion binding Interacting selectively and non-covalently with copper (Cu) ions. 0.0035076
GO:0051015 actin filament binding Interacting selectively and non-covalently with an actin filament, also known as F-actin, a helical filamentous polymer of globular G-actin subunits. 0.0041496
GO:0008378 galactosyltransferase activity Catalysis of the transfer of a galactosyl group to an acceptor molecule, typically another carbohydrate or a lipid. 0.0042160
GO:0008083 growth factor activity The function that stimulates a cell to grow or proliferate. Most growth factors have other actions besides the induction of cell growth or proliferation. 0.0044962
GO:0005272 sodium channel activity Enables the facilitated diffusion of a sodium ion (by an energy-independent process) involving passage through a transmembrane aqueous pore or channel without evidence for a carrier-mediated mechanism. 0.0062678
GO:0005525 GTP binding Interacting selectively and non-covalently with GTP, guanosine triphosphate. 0.0065740
GO:0005201 extracellular matrix structural constituent The action of a molecule that contributes to the structural integrity of the extracellular matrix. 0.0075499
GO:0031683 G-protein beta/gamma-subunit complex binding Interacting selectively and non-covalently with a complex of G-protein beta/gamma subunits. 0.0082007
GO:0004550 nucleoside diphosphate kinase activity Catalysis of the reaction: ATP + nucleoside diphosphate = ADP + nucleoside triphosphate. 0.0083664
GO:0004040 amidase activity Catalysis of the reaction: a monocarboxylic acid amide + H2O = a monocarboxylate + NH3. 0.0086755
showSigOfNodes(GOdataVar.MF, score(MFvar.weight01),
               firstSigNodes = sum(significant.elim),
               wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 121 
## Number of Edges = 157 
## 
## $complete.dag
## [1] "A graph with 121 nodes."

I plot variance portion, but for the term found most significant when using p-values, for comparison.

showGroupDensity(GOdataVar.MF, names(score(MF.weight01))[which.min(score(MF.weight01))])

Cellular component

orderedTerms <- names(sort(score(CCvar.weight01)))
significant.weight01 <- score(CCvar.weight01)[orderedTerms] <= 0.01
significant.lea      <- score(CCvar.lea)[orderedTerms] <= 0.01
significant.elim     <- score(CCvar.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
if (length(sigTerms) == 0) {
   sigTerms <- orderedTerms[significant.lea & significant.elim]
}
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
CC.variance.sigTerms <- sigTerms

CCvar.all <- GenTable(GOdataVar.CC, elim=CCvar.elim, weight01=CCvar.weight01, lea=CCvar.lea,
                      orderBy="weight01", ranksOf="elim", topNodes=max(sum(significant.elim), 10))

kable(
   CCvar.all[CCvar.all$Significant > CCvar.all$Expected,],
   caption = "Most over-represented terms of the Cellular Component ontology.")
Most over-represented terms of the Cellular Component ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
1 GO:0005576 extracellular region 197 37 19.46 1 1.4e-09 3.8e-09 7.8e-11
2 GO:0016021 integral component of membrane 1739 215 171.76 2 8.9e-07 4.1e-06 3.7e-08
3 GO:0005886 plasma membrane 324 44 32.00 4 0.00031 6.0e-05 3.2e-05
4 GO:0016459 myosin complex 57 19 5.63 3 6.2e-05 6.2e-05 6.2e-05
5 GO:0016020 membrane 3059 361 302.14 7 0.00116 6.7e-05 1.8e-09
6 GO:0005743 mitochondrial inner membrane 44 13 4.35 5 0.00037 9.3e-05 0.00037
7 GO:0008076 voltage-gated potassium channel complex 30 5 2.96 6 0.00059 0.00059 0.00059
8 GO:0005929 cilium 52 17 5.14 8 0.00289 0.00188 0.00072
9 GO:0031012 extracellular matrix 11 3 1.09 9 0.00323 0.00323 0.00323
10 GO:0031225 anchored component of membrane 8 4 0.79 10 0.00450 0.00450 0.00450
11 GO:0099512 supramolecular fiber 52 13 5.14 12 0.00708 0.00507 0.00708
12 GO:0030990 intraciliary transport particle 5 2 0.49 11 0.00586 0.00586 0.00586
14 GO:0005581 collagen trimer 7 3 0.69 13 0.00723 0.00723 0.00723
15 GO:0005858 axonemal dynein complex 10 3 0.99 14 0.00783 0.00783 0.00783
kable(
   CCvar.all[CCvar.all$Significant < CCvar.all$Expected,],
   caption = "Most under-represented terms of the Cellular Component ontology.")
Most under-represented terms of the Cellular Component ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
13 GO:0090575 RNA polymerase II transcription factor c… 47 3 4.64 99 0.23664 0.00719 0.23664
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
                             lea=significant.lea,
                             elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
                 Definition=Definition(GOTERM[sigTerms]),
                 PValue=score(CCvar.weight01)[sigTerms]),
      caption = paste('Cellular component terms significantly associated with', VAR,
                      'according to all 3 algorithms', sep=' '))
Cellular component terms significantly associated with regime according to all 3 algorithms
Term Definition PValue
GO:0005576 extracellular region The space external to the outermost structure of a cell. For cells without external protective or external encapsulating structures this refers to space outside of the plasma membrane. This term covers the host cell environment outside an intracellular parasite. 0.0000000
GO:0016021 integral component of membrane The component of a membrane consisting of the gene products and protein complexes having at least some part of their peptide sequence embedded in the hydrophobic region of the membrane. 0.0000041
GO:0005886 plasma membrane The membrane surrounding a cell that separates the cell from its external environment. It consists of a phospholipid bilayer and associated proteins. 0.0000603
GO:0016459 myosin complex A protein complex, formed of one or more myosin heavy chains plus associated light chains and other proteins, that functions as a molecular motor; uses the energy of ATP hydrolysis to move actin filaments or to move vesicles or other cargo on fixed actin filaments; has magnesium-ATPase activity and binds actin. Myosin classes are distinguished based on sequence features of the motor, or head, domain, but also have distinct tail regions that are believed to bind specific cargoes. 0.0000621
GO:0016020 membrane A lipid bilayer along with all the proteins and protein complexes embedded in it an attached to it. 0.0000667
GO:0005743 mitochondrial inner membrane The inner, i.e. lumen-facing, lipid bilayer of the mitochondrial envelope. It is highly folded to form cristae. 0.0000927
GO:0008076 voltage-gated potassium channel complex A protein complex that forms a transmembrane channel through which potassium ions may cross a cell membrane in response to changes in membrane potential. 0.0005872
GO:0005929 cilium A specialized eukaryotic organelle that consists of a filiform extrusion of the cell surface and of some cytoplasmic parts. Each cilium is largely bounded by an extrusion of the cytoplasmic (plasma) membrane, and contains a regular longitudinal array of microtubules, anchored to a basal body. 0.0018827
GO:0031012 extracellular matrix A structure lying external to one or more cells, which provides structural support, biochemical or biomechanical cues for cells or tissues. 0.0032341
GO:0031225 anchored component of membrane The component of a membrane consisting of the gene products that are tethered to the membrane only by a covalently attached anchor, such as a lipid group that is embedded in the membrane. Gene products with peptide sequences that are embedded in the membrane are excluded from this grouping. 0.0044950
GO:0099512 supramolecular fiber A polymer consisting of an indefinite number of protein or protein complex subunits that have polymerised to form a fiber-shaped structure. 0.0050745
GO:0030990 intraciliary transport particle A nonmembrane-bound oligomeric protein complex that participates in bidirectional transport of molecules (cargo) along axonemal microtubules. 0.0058559
GO:0005581 collagen trimer A protein complex consisting of three collagen chains assembled into a left-handed triple helix. These trimers typically assemble into higher order structures. 0.0072277
GO:0005858 axonemal dynein complex A dynein complex found in eukaryotic cilia and flagella; the motor domain heads interact with adjacent microtubules to generate a sliding force which is converted to a bending motion. 0.0078297
showSigOfNodes(GOdataVar.CC, score(CCvar.weight01),
               firstSigNodes = sum(significant.weight01),
               wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 76 
## Number of Edges = 135 
## 
## $complete.dag
## [1] "A graph with 76 nodes."

For comparison, I plot the distribution of variance portion for the CC term found most significant when using p-values.

showGroupDensity(GOdataVar.CC, names(score(CC.weight01))[which.min(score(CC.weight01))])

Comparison between the two ordering of genes.

Biological process

allTerms <- usedGO(GOdata.BP)
vennDiagram(vennCounts(cbind(PValue   = allTerms %in% BP.pvalue.sigTerms,
                             Variance = allTerms %in% BP.variance.sigTerms)))

ggplot(data = data.frame(PValue   = rank(score(BP.weight01))[allTerms],
                         Variance = rank(score(BPvar.weight01))[allTerms]),
       mapping = aes(x = PValue, y = Variance)) +
   geom_point() + geom_smooth() + xlab('Using gene p-values') +
   ylab('Using portion of explained variance') +
   ggtitle('Ordering of BP terms by significance')

Molecular function

allTerms <- usedGO(GOdata.MF)
vennDiagram(vennCounts(cbind(PValue   = allTerms %in% MF.pvalue.sigTerms,
                             Variance = allTerms %in% MF.variance.sigTerms)))

ggplot(data = data.frame(PValue   = rank(score(MF.weight01))[allTerms],
                         Variance = rank(score(MFvar.weight01))[allTerms]),
       mapping = aes(x = PValue, y = Variance)) +
   geom_point() + geom_smooth() + xlab('Using gene p-values') +
   ylab('Using portion of explained variance') +
   ggtitle('Ordering of MF terms by significance')

Cellular component

allTerms <- usedGO(GOdata.CC)
vennDiagram(vennCounts(cbind(PValue   = allTerms %in% CC.pvalue.sigTerms,
                             Variance = allTerms %in% CC.variance.sigTerms)))

ggplot(data = data.frame(PValue   = rank(score(CC.weight01))[allTerms],
                         Variance = rank(score(CCvar.weight01))[allTerms]),
       mapping = aes(x = PValue, y = Variance)) +
   geom_point() + geom_smooth() + xlab('Using gene p-values') +
   ylab('Using portion of explained variance') +
   ggtitle('Ordering of CC terms by significance')

Session info

I save the main variables to be able to load them in a new R session later.

save(allgenes2GO,
     GOdata.BP, BP.elim, BP.weight01, BP.lea, BP.pvalue.sigTerms,
     GOdata.MF, MF.elim, MF.weight01, MF.lea, MF.pvalue.sigTerms,
     GOdata.CC, CC.elim, CC.weight01, CC.lea, CC.pvalue.sigTerms,
     GOdataVar.BP, BPvar.elim, BPvar.weight01, BPvar.lea, BP.variance.sigTerms,
     GOdataVar.MF, MFvar.elim, MFvar.weight01, MFvar.lea, MF.variance.sigTerms,
     GOdataVar.CC, CCvar.elim, CCvar.weight01, CCvar.lea, CC.variance.sigTerms,
     file = paste('Enrichment', TAG, VAR, 'RData', sep='.'))
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=ca_ES.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=ca_ES.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=ca_ES.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] Rgraphviz_2.30.0     ggplot2_3.2.1        limma_3.42.0        
##  [4] knitr_1.27           topGO_2.38.1         SparseM_1.78        
##  [7] GO.db_3.10.0         AnnotationDbi_1.48.0 IRanges_2.20.2      
## [10] S4Vectors_0.24.3     Biobase_2.46.0       graph_1.64.0        
## [13] BiocGenerics_0.32.0 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5   xfun_0.12          purrr_0.3.3        splines_3.6.2     
##  [5] lattice_0.20-38    colorspace_1.4-1   vctrs_0.2.1        htmltools_0.4.0   
##  [9] yaml_2.2.0         mgcv_1.8-31        blob_1.2.1         rlang_0.4.2       
## [13] pillar_1.4.3       glue_1.3.1         withr_2.1.2        DBI_1.1.0         
## [17] bit64_0.9-7        matrixStats_0.55.0 lifecycle_0.1.0    stringr_1.4.0     
## [21] munsell_0.5.0      gtable_0.3.0       memoise_1.1.0      evaluate_0.14     
## [25] labeling_0.3       highr_0.8          Rcpp_1.0.3         backports_1.1.5   
## [29] scales_1.1.0       farver_2.0.3       bit_1.1-15.1       digest_0.6.23     
## [33] stringi_1.4.5      dplyr_0.8.3        tools_3.6.2        magrittr_1.5      
## [37] lazyeval_0.2.2     tibble_2.1.3       RSQLite_2.2.0      crayon_1.3.4      
## [41] pkgconfig_2.0.3    zeallot_0.1.0      Matrix_1.2-18      assertthat_0.2.1  
## [45] rmarkdown_2.1      R6_2.4.1           nlme_3.1-143       compiler_3.6.2